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Abstract 

Inclusion of irrelevant variables in a cluster analysis adversely affects subgroup recovery. 
Tnis paper examines using moment-based statistics to screen variables; only variables which pass 
the screening are then used in clustering. Normal mixtures are analytically shown often to 
possess negative kurtosis. Two related measures, m and coefficient of bimodaiity b, are also 
'examined. 

A Monte Carlo study compared the screening measures to no selection, De Soete's 
(1988) ultrametric weights, and Fowlkes, Gnanadesikan, and Kettenring's (1988) forward 
selection procedure. Screening based on kurtosis degraded recovery and is not recommended. 
In contrast, screening on m or on b improved recovery over both no selection and forward 
selection, and screening performed as well as ultrametric weights. Combining screening with 
ultrametric weights performed extremely well. All methods were found to be somewhat sensitive 
to other types of error. 

Screening variables appears a viable alternative to both ultrametric weights and forward 
selection. The potential advantages and disadvantages of screening are considered. 

Keywords: Variable selection; Cluster analysis of two-mode data; Kurtosis; Hierarchical 
clustering; Euclidean distances. 
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1. INTRODUCTION 

Applications of cluster analysis commonly involve trying to isolate relatively 
homogeneous subgroups of individuals from a collection of entities hitherto thought to be 
homogeneous. Thus, this paper adopts the view of cluster analysis as the attempt to "unmix a 
mixture of distributions" (e.g., Titterington, Smith, & Makov, 1985; McLachlan & Basford, 1988). 
Clusters /re the homogeneous distributions which are mixed, and applications of cluster analysis 
attempt to identify relatively homogeneous subgroups within a more heterogeneous population. 

The first step in such an analysis is to select the necessary entities and variables. Meehl 
(1979) emphasized the use of clinical insight into the domain of interest, and standard sources 
.on cluster analysis such as Everitt (1980), Lorr (1983), and Aldenderfer and Blashfield (1984) 
merely state that the variables should be theoretically relevant. Yet, cluster analysis is useful as 
an exploratory technique, the domain of interest may be known, but the specific variables which 
separate putative subgroups are not known prior to the analysis. 
1.1 The Problem of Irrelevant Variables 

The usual response of applied researchers is to include all possible variables, in the hope 
that the dimensions upon which subgroups differ will be represented by one or more of these 
variables. Unfortunately, such a shotgun strategy is counter-productive. In the process of 
clustering, the two-mode (variables by entities) multivariate data are converted to a single-mode 
(entities by entities) univariate similarity measure, such as Euclidean distance or Q-correlation. 
Including irrelevant variables acts to introduce noise into the similarity measure, obscuring 
subgroup structure. Everitt (1980) reports that algorithms such as single and centroid linkage 
produced similar results when used with similarity data containing error as they did when used 
to cluster unimodal data. This renders such methods eitectively useless, as it is impossible to 
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interpret such a solution (Donoghue, 1987). Milligan (1980) found that the addition of 
irrelevant dimensions resulted in the lowest recovery rates for all of the clustering algorithms he 
.studied. He concludes that "a researcher should be particularly cautious when selecting variables 
to be used in the clustering process." (1980, p. 341). 

The deleterious effect of irrelevant variables is aggravated by attempts to deal with 
other problems. Fleiss and Zubin (1969) noted that standardization of variables (to remove 
effects of variable scale) has the effect of decreasing between-groups spread compared to those 
variables which do not contain subgroups. This implicitly assigns larger weights to variables 
which do not measure the between-groups difference, making the subgroups harder to isolate. 
Simulations by Milligan and Cooper (1988) and Barton (1993) have found that standardizing 
variables can adversely affect recovery by cluster methods. Attempts to deal with problems 
caused by computing Euclidean distances from non-orthogonal variables (e.g., Donoghue, 1993) 
produce similar problems. Hartigan (1975) reports decreased recovery when using Mahalanobis 
distance, and Rohlf (1970) and Chang (1983) discuss problems in clustering based upon principal 
components scores. However, techniques developed by Art, Gnanadesikan, and Kettenring 
(1982) to estimate the pooled within-groups covariance matrix may alleviate this problem 
(Donoghue, 1994). In addition, clustering procedures recently have been proposed which 
combine multidimensional scaling and/or variable weighting with specific clustering algorithms 
(De Soete, DeSarbo, & Carroll, 1985; DeSarbo, Carroll, Clark, & Green, 1984; DeSarbo, 
Howard, & Jedidi, 1991). 
12 Methods to Deal with Irrelevant Variables 

A few general suggestions have appeared which address the problem of irrelevant 
variables. Unlike those just cited, these methods are not tied to the clustering algorithm used, 
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and so are applicable across a variety of clustering algorithms. Fowlkes, Gnanadesikan, and 
Kettenring (1988) suggested a forward selection procedure to determine which variables to 
include in a cluster analysis. At each step, their method selects the variable which maximizes 
PiMai's trace criterion from MANOVA. For each analysis, expected values of the statistic are 
obtained by Mont'; Carlo methods, i.e., 100 draws of n entities from a spherical, p-dimensional 
normal distribution. Forward selection stops when the increase in the trace statistic is less than 
the expected value. The method is computationally intensive, with the amount of computation 
increasing with the square of the number of variables. 

Milligan (1989) examined the use of a variable weighting procedure (De Soete, 1986, 
1988) to deal with irrelevant variables. The method selects weights such that the distances 
computed from the weighted variables maximally satisfy the ultrametric inequality: 

d i:/ <: max(d iJc/ d Jk ) . 

This is equivalent to requiring that all sets of three points lie on an acute isosceles (or 
equilateral) triangle. Johnson (1967) and Milligan (1979) demonstrated the relationship 
between the ultrametric inequality and «nany commonly used hierarchical clustering algorithms. 
Milligan (1989) found that using the ultrametric weights improved cluster recovery when the 
data contained one, two, or three irrelevant dimensions. The amount of computation for this 
method increases with the cube of the number of entities; Milligan reports that the method 
required too much computation to complete an addition simulation condition in which datasets 
contained 250 entities. 
13 Moment-based Variable Screening 

Some researches have suggested screening variables based upon the shape of the 
distribution. For example, Morris et al. (1981, cited in Fletcher & Satz, 1985) have noted that 
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normally distributed variables are not consistent with the presence of subgroups in the sample, 
and Fletcher and Satz assert that the distribution of the variables must be skewed (1985, p. 49) 
in order for the variables to be consistent with the presence of subgroups. While this paper was 
in preparation, a study by Bajgier and Aggarwal (1991) was published which compared the 
power of a variety of univariate distributional tests to detect balanced mixtures. Testing for 
negative kurtosis was the most powerful of the methods they examined. 

In this paper, three variable screening strategies are developed and examined. Section 2 
examines the meaning of univariate kurtosis, and its relationship to bimodality. Section 3 
examines the kurtosis, and two improvements to the kurtosis, the w-index and the coefficient of 
bimodality b. Section 4 gives the design of a simulation study to examine these methods. 
Section 5 reports the results of the simulation, and compares the screening measures to two 
alternatives from the literature. Finally, Section 6 contains discussion of the potential 
advantages and disadvantages of screening, and Section 7 presents suggestions for further work. 

2. THE DISTRIBUTION OF A MIXTURE 
Mixtures are expected to have multiple modes corresponding to the individual subgroups. 
Finucan (1964, p. 112) noted, "a bimodal curve in general has also a strong negative kurtosis." A 
series of notes in The American Statistician also suggest this (Darlington, 1970; Chissora, 1970; 
Hildebrand, 1971), but they also point out that kurtosis is not necessarily negative for bimodal 
distributions. In addition, Eisenberger (1964) has examined the conditions under which a 
-mixture of two normal distributions will be bimodal or unimodal. Distribution B in Figure 1 is 
such a unimodal mixture of two normal distributions. In 1939, Fisher asserted that distributions 
such as B had a lower kurtosis than distribution A, the standard normal distribution. Finucan 
(1964) proved this assertion, as have others from different points of view (Marsaglia, Marshall, 



Variable Screening 



7 



& Proshau, 1965; Ali, 1974). As a result, kurtosis is often interpreted as a measure of whether 
the distribution is sharply peaked or flattened out compared to the normal distribution. Yet, 
Kaplansky (1945) demonstrated that the kurtosis need not be related to the distribution's 
peakedness, and Ali (1974) and Johnson, Lietjan and Beckman (1980) have argued that kurtosis 
is better conceived of as a measure of the thickness of the tails of the distribution. Distributions 
which have thicker tails than the normal take on positive values of kurtosis; those with flatter 
tails take on negative values. Balanda and MacGillivray's (1988) review concluded that "it is 
best to define kurtosis vaguely as the location- and scale-free movement of probability mass 
from the shoulders of a distribution into its center and tails, and to recognize that it can be 
formalized in many ways." (p. Ill) 



A mixture of normal distributions may be unimodal or bimodal. In some cases unimodal 
mixtures of normals can have lower kurtosis than a single normal of equal mean and variance. 
Bimodal distributions generally may have negative kurtosis, although not always. Hence, there 
appears to be some connection between negative kurtosis and mixtures. Thus, we next consider 
the kurtosis of a mixture. 



The measure of kurtosis, is the fourth moment about the mean normalized by the 
variance squared, and compared to the normal's normalized fourth moment (which is 3): 



where M k is the kth moment about the mean. The kurtosis of a mixture of normal distributions 



Insert Figure 1 about here 



3. KURTOSIS OF A MIXTURE 




(1) 
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is given by: 



9zm = 



3 var [o§ + - \i) ] - 2 £ tc^ - ji) 

2»i 



G G 



(2) 



where var is the variance over subgroups and ttj is the proportion in subgroup (the derivation 
of (2) is given in the Appendix). There are three competing processes working to determine the 
kurtosis of a mixture. Heterogeneity of the within-group variances inflates the kurtosis, as do 
differences in the sizes of the subgroups. Differences in the subgroup means work to decrease 
the kurtosis. 

The form of (2) allows two properties to be easily demonstrated: 

A) Common Mean : When the subgroup means are identically equal to fi, all of the 
(/ij - n) 2 terms drop out, and the kurtosis of the mixture is non-negative, g^ j> 0. 

B) Homogeneous Variances : When the subgroup variances are identically equal to 
a 2 , the kurtosis is a function only of the subgroup means. In this case, g^ will be 
less than zero, provided that the n- } are not too dissimilar. Thus, the kurtosis will 
be negative whenever the variance of the squared differences of the means (fij - fi) 2 
is less than two-thirds of the sum of - ft) 4 . While this expression has no simple, 
intuitive meaning, it will be true whenever the subgroups are relatively similar in 
size, for example if the ratio of the sizes is less than 3:1 in the two subgroup case. 

Assuming normality of within— group distributions, platykurtosis (g^ < 0) indicates the 
presence of subgroups. Unfortunately, the converse is not true. Thus, kurtosis may be used as a 
relatively stringent screening measure. The inferences which may be made (in the absence of 
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sampling error) are summarized in Table 1. 



Insert Table 1 about here 



The screening test based upon kurtosis may be improved by including the information 
about the variable's skewness. Such a correction is particularly attractive because kurtosis it; 
most powerful in situations in which the overall distribution is nearly symmetric (in Table 2, cell 
EI and cell II when the Vj are similar). The suewness, g ; is defined as: 



3/2 



For a mixture, this becomes: 



G G 
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G G 

E + E w j - 

The factors of unequal within— group variances and unequal mixing proportions (ir,) induce 
skewness in the mixture distribution. Thus, incorporating a correction for the skew will make 
the test more powerful. 

The kurtosis is always bounded below. This lower bound is usually given as: 

82 > -2. 

However, the actual lower bound (Stuart & Ord, 1987, p. 115) is: 

g 2 + 3>.g/ + l. (3) 

This suggests the index m: 

m = g 2 -g/, (4) 
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which has a uniform lower bound of -2.0 for all variables. 

As with kurtosis, a plausible selection rule is to use variables for which m <_ 0. There is 
no simple expression for (4) in the presence of mixtures. However, it does have several 
desirable properties. It has an expected value of zero for a single (nonmixture) normal 
distribution. Under the condition of homogeneity of subgroup means, it reduces to the kurtosis 
of the mixture, and again has a nonnegative expected value. Similarly, under homogeneity of 
subgroup variances, the value is negative for most cases. 

The coefficient of bimodality b in SAS (1985) also incorporates the bound in (3): 



The coefficient is bounded, 0 < b <_ 1. The manual suggests that values "greater than 0.555 may 
indicate bimodal or multimodal" distributions (1985, p. 272). No explanation is given for this 
value, but it is the expected value of the statistic for a uniform distribution, and assumes that 
values larger than this are likely to reflect true subgroup structure (W. S. Sarle, personal 
communication, June 11, 1987). To date, no studies have investigated the efficacy or power of 
this measure. The expected value of the statistic is .333 for a single normal distribution. Large 
values of b suggest multimodality. It will take on values less than or equal to .333 when no 
mean differences are present. Also, b is more sensitive than is m; m < 0 implies b > .333. 

To illustrate the behavior of m, and b, the expected values of each were calculated for 
a variety of mixtures. Table 2 gives values for each of the measures for selected combinations 
of subgroup proportions, means and variances for mixtures of two and three subgroups, and 
Table 3 gives the values of each of the statistics for several common probability distributions. 
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Insert Tables 2 and 3 about here 

Table 3 illustrates a desirable property of the measure m. Its expected value is independent of 
the distributional parameters for several of the distributions examined. In the remaining cases it 
is a very simple function. This is not true of b. It is not clear whether this property is more 
important than the slightly greater sensitivity of b. 

4. SIMULATION DESIGN 
A Monte Carlo study was undertaken to systematically evaluate the proposed screening 
measures on the ability of common clustering algorithms to recover a known subgroup structure. 
4.1 Method 

Data were generated using a modified version of the algorithm given by Milligan (1985). 
This algorithm has been used in a number of studies. Each dataset consisted of 50 observations. 
Within a subgroup, observations were drawn from a truncated multinormal distribution, with 
observations constrain A to lie within the range, fi } ± 1.5 a } in the first dimension. In addition, 
subgroup boundaries were well separated in the first dimension. This insured that there was no 
■overlap among the subgroups. 
Design 

The chief variable of interest was the effect of the variable selection/weighting 
procedures which were applied to th" tasets. Four additional factors were manipulated in the 
data generation: 

1) Number of subgroups (4 levels) -- 2, 3, 4, or 5 subgroups, 

2) Number of "core" variables (3 levels) -- 4, 6, or 8. Subgroup means differed on each 

of these dimensions. 
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3) Density of subgroups (3 levels): 

a) equal sized subgroups, 

b) the first subgroup was 10% of observations, other subgroups were equal sized, 

c) the first subgroup was 60% of observations, other subgroups were equal sized. 
These three factors were fully crossed to yield 36 (4 X 3 X 3) conditions. Three replicate 
datasets were generated per condition, for a total of 108 base datasets. Each of the base 
datasets was then modified according to seven error conditions: 

4) Error condition 

a) No error 

b) One normally distributed noise dimension 

c) Two normally distributed noise dimensions 

d) Three normally distributed noise dimensions 

e) Error perturbed coordinates, low error, X = 1 

f) Error perturbed coordinates, high error, X = 2 

g) Outlier condition, 10 observations (i.e., 20%) which did not fall within any 
subgroup were added to the dataset. 

For the error perturbed coordinates condition, the normally distributed error was added to the 
original coordinates: 

E jik ~ A jik + ^ e jik 
e jik - N(0,a 2 jk ) 

This resulted in a total of 756 datasets. See Milligan (1985) for additional details on the data 
generation and error conditions. The variables in each dataset were then weighted and/or 



Variable Screening 
13 

screened using each of the 10 methods listed below, yielding 7,560 weighted datasets. Each 
weighted dataset was then analyzed by four clustering algorithms, making a total of 30,240 
clusterings. For each clustering, the solution for the correct number of subgroups was used as 
the result for that method. 
Selection Algorithms 

Each data set was subjected to 10 variable weighting/selection strategies: 

A) No selection, 

B) g 2 < -1.2, 

C) g 2 significantly < 0. This was determined at a - .05, using the tables in Chen (1983), 

D) g 2 < 0, 

E) m < -1.2, 

F) m < 0, 

G) b > .555, 

H) b > .333, 

I) De Soete's (1986, 1988) ultrametric weighting algorithm, 

J) Fowlkes, Gnanadesikan, and Kettenring's (1988) forward selection algorithm. 
Cluster Algorithms 

The 10 weighted versions of each dataset were then analyzed four times, corresponding 
to different hierarchical clustering algorithms and measures of similarity. The clustering 
methods were: (a) Single linkage, Euclidean distance; (b) Complete linkage, Euclidean distance; 
(c) Average linkage, Euclidean distance; (d) Ward's method (minimum variance), squared 
Euclidean distance. The clustering methods were chosen because they are widely used and 
average linkage and Ward's method have consistently performed well in previous studies. For a 
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discussion of these algorithms, the reader is referred to standard introductions to cluster analysis 
(e.g., Everitt, 1980; Lorr ; 1983). 
Outcome Measure 

The outcome measure for the study was the Hubert and Arabie (1985) modification of 
Rand's (1971) statistic, which will be denoted HA-Rand. The index was computed between each 
cluster solution and the true subgroup membership used to generate the data. This index is 
based on examining pairs of entities, and determining whether they are classified into the same 
or different subgroups. A value of zero reflects chance agreement with the true membership, 
and 1.0 reflects perfect agreement. A study by Milligan and Cooper (1986) supports the 
accuracy of Hubert and Arabie's modification. 
Computer Programs 

Data were generated using a modified version of Milligan's (1985) program. The 
weights for De Soete's algorithm were computed using his program OVWTRE (De Soete, 
1988). 1 The moment statistics, the Fowlkes, Gnanadesikan, and Kettenring (1988) forward 
selection algorithm, and clustering algorithms were computed using FORTRAN programs 
written by the author. Accuracy of these programs was ensured through numerous comparisons 
of results of subroutines and final classifications with routines from SAS and SPLUS. 
Eigenvalues were computed using routines from EISPACK (Smith, Boyle, Garbow, Ikebe, 
Klema, & Moler, 1974). Note that the forward selection procedure in Fowlkes, Gnanadesikan, 
and Kettenring (1988) was developed in terms of the complete link clustering method. For the 
present study, the full method of determining expected values via Monte Carlo methods and 



The author is indebted to Glenn Milligan for providing a copy of the source code of his generation 
program and OVWTRE. 
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then performing forward selection was applied to each of the clustering algorithms. 

5. SIMULATION RESULTS 

Although ANOVA might seem a natural means to summarize the results, it was not used 
for the present study. 2 The primary independent variable of interest was the 
weighting/selection method used in the analysis. The other factors in the design were included 
to ensure that their effects were systematically present, anu so would not confound the results 
concerning variable weighting/selection. It is more meaningful to Directly examine the 
comparisons of interest, using multiple comparison procedures to control overall Type I error 
rate. Still, there may be interest in the main effects of the other variables in the study. These 
are summarized in Appendix Table Al. In general, the effects replicate those in other studies 
(e.g., Milligan, 1980, 1989). 

The variable screening/weighting methods primarily were compared using a distribution 
free ordinal procedure, Cliffs (1993) method of comparing the order of two distributions. 
Ordinal comparisons were performed using a modified version of Cliffs (1992) program 
PAIRDEL1, for paired observations. Two types of ordinal hypotheses were assessed. The first, 
based on the index is the proportion of datasets for which one method yielded higher 
recovery than the other method minus the proportion for which it yielded lower recovery; it is 
the net proportion of datasets with improved recovery. Negative values of d w indicate lower 
recovery for first method. The second ordinal procedure estimates the probability that a 
randomly sampled observation from one distribution has a larger value than a randomly sampled 



2 In addition, a preliminary investigation of the within-cell means revealed substantial heterogeneity of 
variance, violating the ANOVA assumption, and making the ANOVA tests suspect. 
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observation from the other distribution. This results in one of three decisions for each pair of 
clustering methods: a) Method A is higher (better recovery) than Method B; b) Method B is 
higher than Method A; or c) the methods do not significantly differ. In addition, pairwise t-tests 
of means were also computed. 

The results will be discussed in six sections. In Section 5.1, the minimal requirement of 
effectiveness for the proposed screening measures is examined: Does using the measure yield an 
"improvement over no screening? Measures which provide no improvement are certainly not 
worth adopting. Next, Section 5.2 compares recovery using the variable screening to two 
suggestions from the literature: (a) variable weights to maximize agreement of the distances with 
the ultrametric inequality (De Soete's V 68 program); and (b) the forward selection procedure 
of Fowlkes, Gnanadesikan, and Kettenring (1988). Section 5.3 looks at the robustness of the 
screening procedures; how do they perform in the presence of other types of error (perturbed 
coordinates and outliers)? Section 5.4 evaluates the effect of combining variable screening with 
ultrametric weights. Next, Section 5.5 examines the interaction of the best screening/weighting 
methods with clustering algorithms. Finally, Section 5.6 explores the effec'. of variable 
standardization on the behavior of the forward selection algorithm. 
5.1 Effectiveness 

The minimal requirement of effectiveness is that using the variable screening/weighting 
method yield an improvement over using no selection. To address this issue, each method was 
compared to no selection. These comparisons were made on HA-Rand index values pooled 
over all datasets containing 0, 1, 2, or 3 error dimensions. Table 4 summarizes the results of 
these comparisons. 
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Insert Table 4 about here 



The three methods based on the kurtosis, g 2 < -1.2, g 2 s'gnif. < 0, and g 2 < 0, yielded 
significantly worse recovery than no selection. Negative d w values indicate that the kurtosis- 
based measures resulted in lower HA-Rand values for 9-27% of the datasets. This result differs 
sharply from the results of Bajgier and Aggarwal (1991), who found kurtosis to be the most 
powerful measure for detecting mixtures. However, Bajgier and Aggarwal only examined 
balanced mixtures, i.e., mixtures with equal mixing proportions and equal variances. To 
determine whether this accounted for the difference in findings, Figure 2 plots mean HA-Rand 
index results for no selection and for each of the kurtosis-based measures by subgroup size. 
Consistent with Bajgier and Aggarwal, the kurtosis-based measures function well for equal-sized 
subgroups. When the subgroups differ in size, however, these procedures do not function very 
well. Overall, therefore, kurtosis-based measures do not meet the \ asic test of effectiveness as 
screening procedures and will not be discussed further. 



Insert Figure 2 about here 



As was noted above, unequal subgroup sizes induce skewness in the overall distributions. 
Thus, the measures which incorporate information about skewness, b and m, may be more 
useful. Table 4 reveals that all four of the screening methods involving b or m yield better 
recovery than no selection. In addition, neither the m-index nor b showed a large effect for 
subgroup size. The largest effect for subgroup size was a difference in HA-Rand index of 
approximately .06; for the kurtosis-based measures the effects ranged from .15 to .30. 

Weighting the variables to maximize agreement with the ultrametric inequality yielded 
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significantly higher recovery than no selection. The forward selection method did not differ 

from no selection in the ordinal comparisons. However, paired t-tests indicated that forward 

selection did yield a significantly higher mean HA-Rand index than no selection. The 

differences in these results will be examined in more detail in the next section. 

52 Comparison with Other Weighting/Selection Methods 

Pairwise comparisons of the 7 remaining methods 3 were made on HA-Rand index values 

for analyses of all datasets containing 0, 1, 2, or 3 ejror dimensions. Shaffer's (1986) 

modification to the Bonferroni correction was used to maintain familywise Type 1 error rate of 

a =0.05. Filially, these pairwise relations were converted into ranks, based upon the number of 
methods which were significantly higher than a given method versus the number of methods 

which were significantly lower. These results are summarized in Table 5. 

Insert Table 5 about here 

Table 5 also presents mean recovery for each number of error dimensions. When there 
are no error dimensions, m < 0 yields similar recovery to no selection, while m < -1.2 gives 
somewhat worse recovery. Screening based on the test of normality, m < 0, recovery is 
somewhat affected by the addition of error dimensions, but less so than no selection. On the 
other hand, screening based on the uniform distribution (m < -1.2) yields similar results for all 
numbers of error dimensions. Overall, in the presence of error dimensions, however, both the 
normal and uniform tests yield HA-Rand recovery values higher than those for no selection. 
The pattern of results for b, the coefficient of bimodality, is very similar to that for the m-index, 



Kurtosis-based methods are not discussed due to their poor performance in the comparison with no 
selection. 
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although the normality test (b > .333) provides only minimal improvement over no selection. 

In comparing the relative effectiveness of the screening measures with the ultrametric 
weights and forward selection procedures, Table 5 reveals that recovery using the ultrametric 
weights did not significantly differ from the screening methods based on a uniform distribution 
(b > .555 and m < -1.2), but outperformed both methods based on a normal distribution 
(b > .333 and m < 0). The paired t-test results indicated that screening based on m < 0 did 
not differ from the ultrametric weights, but screening based on b > .333 was still worse. Based 
on the ordinal comparisons, the forward selection method was found to yield lower cluster 
recovery than all four of the variable selection methods using b and m. However, based on the 
paired t-test results, the forward selection method is superior to selection based on b > .333, 
and did not differ from the other methods. 

A word is in order concerning discrepancies between the rank orders derived from the 
ordinal comparisons and those implied by paired Mest of the means. Forward selection has a 
noticeably higher mean than selection based on b > .333, yet the ordinal comparison indicates 
that recovery for forward selection is significantly lower. This seeming paradox points out the 
different questions addressed by the two comparisons. The ordinal method compares 
differences in direction, but the means take into account the size of those differences. 
Examination of the differences in the individual solutions confirms that b > .333 yields more 
cluster solutions with HA-Rand that is higher than forward selection than vice versa. However, 
forward selection occasionally produces a solution which is much better, giving forward selection 
a higher mean. Thus, both are legitimate answers to the question: Which method is better? 
5 3 Robustness to Other Types of Error 

An additional issue in comparing the methods is their sensitivity to other types of error 
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contaminating the cluster structure. Milligan's cluster generating program 'ncludes three 
additional error conditions: (a) data perturbed by adding an error to each coordinate, low error 
variance; (b) data perturbed by adding an error to each coordinate, high error variance; and 
(c) including an additional 20% (i.e., 10 cases) which do not lie within the boundaries of any of 
the clusters (i.e., outliers and intermediates). These conditions will be referred to as, 
respectively, low error, high error, and outlier conditions. 

Table 6 gives the mean HA Rand index and ranks based on ordinal pairwise 
comparisons of the methods for each of the error conditions. For all conditions, b > .555 and 
m < -1.2 yield much lower recovery than other methods, indicating that these selection methods 
degrade cluster recovery in the presence of error other than spurious variables. On the other 
hand, variable selection methods based on normality, b > .333 and m < 0, are relatively robust, 
and show little difference in cluster recovery from that of no selection. 

Insert Table 6 about here 

5.4 Combined Methods 

Variable selection based on the m-index and the coefficient of bimodality b are effective 
in reducing the effects of spurious dimensions. Variable weighting based on the ultrametric 
inequality is also effective. This section examines the effectiveness of combining the two 
strategies, selection and ultrametric weights. Variations of the forward selection method were 
not considered; forward selection is extremely computationally intensive, and combining the 
method with other techniques was not feasible for the purposes of this study. 

Each of the datasets was reanalyzed. First, one of four variable screening itu thods 
(m < -1.2, m < 0, b > .333, or b > .555) was applied. The variables passing the screening were 
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then analyzed using De Soete's (1988) program to determine the variable weights. If no 
variables passed the screening, all variables were used. The weighted distances were then 
computed and analyzed by the clustering algorithms, as described in the Method section. 

In order for the combination of weighting and screening to be effective, the results of the 
combined methods must be superior to both weighting alone and screening alone. Table 7 
summarizes comparisons of the combined methods to (a) screening alone and (b) using only the 
ultrametric weights. Although most of the methods show improvement, the combination of 
weights and screening based on m < -1.2 yielded worse recovery than did either method alone. 
The negative values of d w and the ordinal z-test indicate lower recovery for the combined 
method, although the paired Mest indicates that the combined method yields a higher mean 
than does m < -1.2 alone. This pattern of results suggests that the combined method often 
yields somewhat lower recovery, but occasionally does much better Ehan screening alone. 
Combining weighting with screening based on b > .555 yielded increased recovery in a net 2-3% 
of the datasets, and gives higher mean recovery, although the overall comparison of distributions 
does not significantly differ from screening alone. Finally, combining weighting with either of 
the two methods based on a normal distribution (m < 0 and b > .333) clearly improves 
recovery. 

Insert Table 7 about here 

Table 8 summarizes the results of applying the combined procedures to datasets with 0, 
1, 2, or 3 error dimensions. In addition, results for five additional methods (No Selection, 
ultrametric weights, forward selection, and the unweighted versions of m < -1.2, and b > .555) 
are repeated from Table 5. Overall, best recovery was obtained for m < 0 with weights and 
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b > .555 with weights. Although the ordinal test indicated that the latter did not differ from the 
unweighted version, the mean for the combination method is much higher. The profile of 
means for b > .555 is impressive; there is virtually no effect of increasing from 0 to 3 error 
dimensions. On the other hand, m < 0 shows a modest effect of increasing error dimensions. 

Insert Table 8 about here 

Table 9 presents the results for the other error conditions: low error, high error, and 
20% outlier. The combined method based on b > .555 shows considerable sensitivity to the 
other types of error, and is uniformly among the three methods with the lowest recovery. The 
combined method based on m < 0 is much less sensitive to the other types of error, and does 
.not differ from ultrametric weights only for the low error or high error conditions. Comparison 
with Table 6 reveals that the means are very similar to the unweighted version for these two 
conditions, although the combined method does appear to be somewhat affected by the presence 
of outliers. 

Insert Table 9 about here 

5.5 Interaction of Variable Screening with Clustering Methods 

An additional question of interest is whether the variable weighting/screening methods 
differed in usefulness for the different clustering algorithms. To address this issue, the mean for 
each clustering algorithm was computed for six of the eight methods listed in Table 6. Screening 
based on b > .555 and m < -1.2 were omitted. For each of the clustering algorithms, the 
omitted methods showed a very similar pattern to other screening methods, which also yielded 
higher recovery. Means for the average linkage algorithm are plotted in Figure 3. Results for 
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Ward's method are shown in Figure 4. Recovery for the complete linkage algorithm is 
portrayed in Figure 5 and means for single linkage are plotted in Figure 6. 

Insert Figures 3 through 6 about here 

There were few large interactions between clustering algorithms and the variable 
weighting/screening methods. The notable exception is the behavior of forward selection for the 
single linkage algorithm. For the other algorithms forward selection appears to have little to 
recommend it; using forward selection with average linkage yields recovery which is uniformly 
lower than that of any other method, including no selection. On the other hand, the method 
gives uniformly high recovery when used with single linkage clustering, and is the best method 
for that algorithm. Other method by clustering algorithm interactions were relatively small. 
5.6 Behavior of the Forward Selection Method 

The relatively poor performance of the forward selection method of Fowlkes, 
Gnanadesikan, and Kettenring (1988) was unexpected. Results presented in their paper 
indicated that the method was very effective, if somewhat computationally intensive. The 
datasets in their study tended to have variables with similar variances. In this study, both within- 
group and overall variances were allowed to differ rather widely. The forward selection 
procedure standardizes each of the variables by its total variance in order to remove spurious 
scale effects from the computation of eigenvalues used in the selection. Milligan and Ccoper 
(1988) and Barton (1993) have found that this method of standardization can adversely affect 
recovery by cluster methods. 

Of the methods used in this study, only the forward selection procedure used 
standardized variables. To investigate whether this difference n light have caused the unexpected 
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performance of the forward selection data, the datasets were reanalyzed. Only those variables 
selected by the forward selection procedure were included in the analysis, but the variables were 
not standardized in forming the Euclidean distances. Table 10 compares results from this 
method with the standardized version of forward selection, no selection, ultrametric weights, and 
the two best combined methods for screening and weights. Forward selection was adversely 
affected by variable standardization; standardizing variables leads to lower recovery in almost 
9% more datasets than vice versa. Comparisons with other methods reveal that without 
standardizing variables, forward selection is superior to no selection, and ordinal comparisons 
indicate that it does not differ significantly from the other methods. Mean comparisons indicate 
marginally better recover}' than ultrametric weighting alone and marginally worse recovery than 
screening based on b > .555 combined with weights. Means for each number of irrelevant 
dimensions are presented in Table 11. 

Insert Table 10 and Table 11 about here 

6. DISCUSSION 

Replicating the work of other authors, the inclusion of irrelevant dimensions was found 
to severely degrade cluster recovery. This paper examined the usefulness of moment-based 
univariate statistics to screen variables. Only variables which pass the screening are then used in 
the clustering. Results for screening based on the kurtosis measure g 2 were very poor. For 
subgroups of equal size, g 2 functioned fairly well, but did very poorly for unequal sized 
subgroups. Thus, it appears that the results of Bajgier and Aggarwal (1991) do not generalize, 
and screening based on g 2 cannot be recommended for applied clustering. 

In contrast, screening based on the index m and on the coefficient of bimodality b 
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functioned well. Both measures provided increased recovery over no selection and forward 
selection, and versions of each (m < -1.2 and b > .555) performed as well as the ultrametric 
weights. However, there is evidence that all of the weighting/ selection methods are sensitive to 
types of error other than spurious dimensions. Selection based on b > .555 and m < -1.2 were 
most severely affected, although forward selection and ultrametric weights were also affected. 
Selection based on b > .333 was least affected, followed by m < 0. 

Combining variable screening with ultrametric weights performed very well. Two 
combinations specifically, m < 0 with weights and b > .555 with weights showed improved 
cluster recovery in the presence of irrelevant dimensions. However, the combined methods 
(particularly b > .555 with weights) were sensitive to types of error other than irrelevant 
dimensions. The combined method based on m < 0 was better, although it does appear to be 
somewhat more sensitive to outliers than either screening alone or ultrametric weights alone. 
However, procedures have been developed to identify outliers prior to clustering (e.g., Barton, 
1991). The use of such procedures may further improve the performance of the combined 
methods. 

One limitation of the present study is that the overall sample size, excluding outliers, was 
held constant. It is possible that the various variable selection/weighting methods examined 
here may be dependent on this aspect of the data. Further work should examine the extent to 
which this is true. 

These results indicate that variable screening based on b and m are viable alternatives to 
■both the ultrametric weighting method and the forward selection method. It is worthwhile to 
briefly consider the relative advantages and disadvantages of screening, compared to the other 
methods. The advantages of m and b are ease and speed of computation, ready availability, and 
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potential applicability to a wide variety of clustering methods. Potential disadvantages include 
the large sampling variability of higher moments, the dependence on the mixture model of 
clustering, and the potential of univariate methods to fail to identify variables which, although 
individually providing little information, as a set yield large subgroup separation. 
6.1 Potential Advantages 

The measures m and b are based on the moment statistics, the skewness and the 
kurtosis. Thus, they are simple and quick to compute, and the amount of computation in 
examining a given dataset increases linearly with the number of entities and with the number of 
variables. In contrast, both ultrametric weights and forward selection are computationally 
intensive, making their use problematic for large datasets. Computation for the ultrametric 
weighting algorithm increases with the cube of the number of entities, while computation for the 
forward selection method increases with the square of the number of variables. Indeed, well 
over 95% of the computational effort of the simulation results reported here were devoted to 
■the forward selection method. 

The components of m and b, the skewness measure g } and the kurtosis, g^ are widely 
available as standard descriptive statistics. Thus, these measures may be adopted easily by 
researchers. The ultrametric. weights require alternating two multivariate optimization problems, 
a task which may well be beyond many applied researchers. The method is not widely available, 
i.e., in statistical packages, although De Soete (1988) has a program to compute the weights. 
The forward selection procedure is even harder to implement. Determining the expected value 
of the trace statistic requires drawing multiple multivariate samples, performing a MANOVA 
decomposition of the results of clustering each sample, and computing the resultant eigenvalues. 
This is only moderately demanding in an interactive statistical environment such as S-PLUS or 
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GAUSS, a major undertaking in FORTRAN or C, and borders on herculean in a statistical 
package such as SPSS or BMDP. 

The measures m and b are suggested based on analysis of the mixture model of 
clustering. Thus, use of the measures is justified with other types of clustering algorithms, such 
as iterative partitioning algorithms (i.e., k-means) or direct application of finite mixture models, 
although the empirical utility of using the methods in these settings has yet to be established. 
The ultrametric weights, on the other hand, are closely tied to hierarchical clustering. The 
proofs by Johnson (1967) and Milligan (1979) specifically relate to hierarchical clustering. There 
is no logical reason to expect ultrametric weights to improve clustering by nonhierarchical 
algorithms, although it may be empirically found to be useful. The forward selection method is 
closely related to the normal mixture conception of clustering. Thus, its application is logically 
valid, although some operational details of the application of the method would need be to 
resolved. 

62 Potential Disadvantages 

The relationship of the variable screening measures to mixture models may also be a 
disadvantage. Ultrametric weighting may apply in other conceptions (e.g., graph-theoretic) of 
hierarchical clustering. In these cases, the mixture model conception may not make sense. The 
utility of m and b would have to be established empirically in such situations. Similarly, some 
applications of cluster analysis are inherently hierarchical (e.g., evolutionary biology), and again 
the use of the screening measures would have to be established empirically. 

Another potential disadvantage of the screening measures is their dependence on the 
third and fourth moments about the mean, which are rather poorly estimated in samples. This 
raises valid concern over the degree to which m and b may fluctuate simply due to sampling 
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variation. The simulation results presented in Section 5 offer some encouragement along these 
lines. Each, dataset contained a total sample of 50 entities, yet both measures were successful in 
screening irrelevant variables. Still, more knowledge about the variability of these screening 
measures would be helpful. 

Finally, a potential disadvantage in using univariate techniques such as m or b or the 
forward selection procedure is that a linear combination of two or more variables may provide 
good separation between subgroups, while neither of the marginal distributions reveals much 
separation. 4 There is a danger that the screening may drop such variables, and so lose 
information about the subgroup separation. It is unknown how the uitrametric weighting 
method would be affected by such a combination of variables. The forward selection procedure 
-may be less prone to this type of behavior. It is based on a MANOVA test statistic, and is 
sensitive to subgroup separation based on linear combinations. However, if neither variable 
provides sufficient univariate separation for inclusion, the forward selection procedure will not 
detect that the pair provides good separation. It remains for future research to determine how 
adversely affected the weighting/selection methods are by such combinations of variables. 



4 The author would like to thank an anonymous reviewer for pointing out this possibility. 
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7. CONCLUSION 

Taken as a whole, the results of this work are promising. The screening measures m and 
b were successful in alleviating the deleterious effect of including irrelevant variables. Both 
•measures provided increased recovery over no selection and forward selection, and versions of 
each performed as well as the use of ultrametric weights. The success of the combination 
variable screening and ultrametric weights commends the ^se of these combined techniques; the 
combination yielded better recovery than either method separately in a net 2.4-6.7 percent of the 
datasets analyzed. However, it is not known to what extent these findings are sensitive to 
specific aspects of this study. This is especially true of the distribution of the irrelevant 
variables, and the structure of the subgroup separation. Clearly, more work along these lines is 
warranted. 



0^ 



Variable Screening 
30 

References 

Aldenderfer, M. S., & Blashfield, R. K. (1984). Cluster analysis . Sage University Paper Series 
on Quantitative Applications in the Social Sciences 07—044. Beverly Hills: Sage 
Publications. 

Ali, M. M. (1974). Stochastic ordering and kurtosis measure. Journal of the American 
Statistical Association, 69. 543—545. 

Art, D., Gnanadesikan, R., & Kettenring, J. R. (1982). Data based metrics for cluster analysis. 
Utilitas Mathematica, 21A. 75-99. 

Bajgier, S. M., & Aggarwal, L. K. (1991). Powers of goodness-of-fit tests in detecting balanced 
mixed normal distributions. Educational and Psychological Measurement. 51. 253-269. 

Balanda, K. P., & MacGillivray, H. L. (1988). Kurtosis: A. critical review. American Statistician. 
42, 111-119. 

Barton, ^. M. (1991, April). Outlier detection in cluster analysis using weighted 

multidimensional scaling . Paper presented at the annual meeting of the American 
Educational Research Association, Chicago, JJL. 

Barton, R. M. (1993, April). Standardizing variables in cluster analysis . Paper presented at the 
annual meeting of the American Educational Research Association, Atlanta, GA. 

Chang, W-C. (1983). On using principal components before separating a mixture of two 
multivariate normal distributions. Applied Statistics. 32, 267-275. 

Chen, W. W. S. (1983). On the comparison of some measures of kurtosis: A test of normality. 
American Statistical Association 1983 Proceedings of the Statistical Computing Section. 
pp. 217-222. 

•Chissom, B. S. (1970). Interpretation of the kurtosis statistic. The American Statistician, 24 
(10), 19-22. 

Cliff N. (1992). PAIRDEL1.BAS: Program for computing matched-data d-statistics [computer 
program]. Los Angeles: Psychology Department, University of Southern California. 

Cliff, N. (1993). Dominance relations: Ordinal analyses to answer ordinal questions. 
Psychological Bulletin . 114. 494-509. 

Darlington, R. B. (1970). Is kurtosis really "peakedness"? The American Statistician . 24 (2), 
19-20. 

De Scete, G. (1986). Optimal variable weighting for ultrametric and additive tree clustering. 
Quality and Quantity. 20, 169-180. 



O 1 



Variable Screening 



31 

De Soete, G. (1988). Software abstract - OV WTRE: A program for optimal variable weighting 
for ultrametric and additive tree fitting. Journal of Classification, 5, 101-104. 

De Soete, G., DeSarbo, W. S., & Carroll, J. D. (1985). Optimal variable weighting for 

hierarchical clustering: An alternating least-squares algorithm. Journal of Classification. 
2, 173-192. 

DeSarbo, W. S., Carroll, J. D., Clark, L. A., & Green, P. E. (1984). Synthesized clustering: A 
method for amalgamating alternative clustering bases with differential weighting of 
variables. Psychometrika. 49, 57-78. 

DeSarbo, W., Howard, D. J., & Jedidi, K. (1991). MULHCLUS: A new method for 

simultaneously performing multidimensional scaling and cluster analysis. Psychometrika. 
56, 121-136. 

Donoghue, J. R. (1987). Cluster analysis of learning disabled children. Unpublished masters 
thesis. California State University: Northridge, CA. 

■Donoghue, J. R. (1993, April). A Monte Carlo study of the effects of within-group covariance 
structure on recovery in cluster analysis . Paper presented at the annual meeting of the 
American Educational Research Association, Atlanta, GA. 

Donoghue, J. R. (1994, April). Comparing the effectiveness of cluster analysis weighting 

procedures for within-group covariance structure: The Bivariate Case . Paper presented 
at the annual meeting of the American Educational Research Association, New Orleans, 
LA. 

Eisenberger, I. (1964). Genesis of bimodal distributions. Technometrics, 6, 357—363. 

Everitt, B. S. (1980). Cluster analysis . (2nd ed.). London: Halstead Press. 

Finucan, H. M. (1964). A note on kurtosis. Journal of the Royal Statistical Society. Series B . 
26, 111-112. 

Fisher, R. A. (1939). Statistical methods for research workers . London: Oliver and Boyd. 

Fleiss, J. L., & Zubin, J. (1969). On the methods and theory of clustering. Multivariate 
Behavioral Research. 4, 235—250. 

"Fletcher, J. M., & Satz, P. (1985). Cluster analysis and the search for learning disabilities 

subtypes. In B. P. Rourke (Ed.), Neuropsychology of learning disabilities : Essentials of 
subtypes analysis (pp. 40—64). New York: Guilford. 

Fowlkes, E. B., Gnanadesikan, R., & Kettenring, J. R. (1988). Variable selection in clustering. 
Journal of Classification. 5, 205-228. 



Variable Screening 
32 

Hartigan, J. A. (1975). Clustering algorithms . New York: John Wiley and Sons. 

Hildebrand, D. K. (1971). Kurtosis measures bimodality? The American Statistician , 25 (2), 
42-43. 

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification . 2, 193—218. 

Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32 241-254. 

Johnson, M. E., Lietjan, G. L., & Beckman, R. J. (1980). A new family of distributions with 

applications to Monte Carlo studies. Journal of the American Statistical Association, 75, 
276-279. 

Kaplansky, I. (1945). A common error concerning kurtosis. Journal of the American Statistical 
Association. 40, 259. 

Lorr, M. (1983). Cluster analysis for social scientists . San Francisco: Jossey Bass Publishers. 

Marsaglia, G., Marshall, A. W., & Proshau, F. (1965). Moment crossings as related to density 
crossings. Journal of the Royal Statistical Society. Series B. 27, 91—93. 

McLachlan, G. J. & Basford, K. E. (1988). Mixture models: Inference and applications to 
cl ustering . New York: Marcel Dekker. 

Meehl, P. E. (1979). A funny thing happened to us on the way to the latent entities. Journal of 
Personality Assessment, 43, 564—577. 

Milligan, G. W. (1979). Ultrametric hierarchical clustering algorithms. Psychometrika. 44, 343- 
346. 

Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on 
fifteen clustering algorithms. Psychometrika. 45, 325—342. 

Milligan, G. W. (1985). An algorithm for generating artificial test clusters. Psychometrika. 50, 
123-127. 

Milligan, G. W. (1989). A validation study of variable weighting algorithm for cluster analysis. 
Journal of Classification. 6, 53-71. 

Milligan, G. W. & Cooper, M. C. (1986). A study of comparability of external criteria for 
hierarchical cluster analysis. Multivariate Behavioral Research. 21, 441-458. 

Milligan, G. W. & Cooper, M. C. (1988). A study of standardization of variables in cluster 
analysis. Journal of Classification . 5_, 181-204. 



r|c 



35 



Variable Screening 
33 

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the 
American Statistical Association. 66, 846-850. 

Rohlf, F. J. (1970). Adaptive hierarchical clustering schemes. Systematic Zoology. 19, 58—82. 

SAS (1985). SAS user's guide : Statistics, version 5 edition . Cary, NC: SAS Institute, Inc. 

Shaffer, J. P. (1986). Modified sequentially rejective multiple test procedures. Journal of the 
American Statistical Association. 81. 826-831. 

Smith, B. T., Boyle, J. M., Garbow, B. S., Ikebe, Y., Klema, V. C, & Moler, C. B. (1974). 
Matrix eigensystem routines: EISPACK guide . New York: Springer- Verlag. 

Stuart, A., & Ord, J. K (1987). Kendall's advanced theory of statistics: Vol 1. Distributional 
theory (5th ed.). London: Charles W. Griffin & Co. 

Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of finite mixture 
distributions. New York: Wiley. 



Variable Screening 



35 

Table 2 



Screening Measures for Selected Two and Three Subgroup Mixtures 
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ir,) b 
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-.664 
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.295 


1, 1, 4, .3 .4 .3 


.359 


-.008 
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' /Xj = 1, fii is determined such that the overall mean = 0; b /x ( = 1, ^ = 0, /x 3 is determined such 
that the overall mean = 0. 
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Table 3 



Expected Values of Screening Measures for Selected Distributions 





ii 


82 


/7J 


h 


Bernoulli (p) 


1 - 2p 


1 -6 






Mi - P) 


" P) 




1 


Binomial (n,p) 


1 - 2p 


— 

1 - 6p + 6> 2 


_2 


1 - 4p + 4p 2 + np - np 2 


V»p(i - p) 




n 


1 - op + op + "p ~ 3np 


Geometric (p) 


2 -p 


6 + 


2 


5 - 5p + p 1 


V 1 P 


1 -n 


y - + p 


Poisson (m) 


-fin 
tn 


1 

m 


0 


m + 1 
3m + 1 


Exponential (X) 


2 


6 


2 


.556 


Normal (ji,o*) 


0 


0 


0 


.333 




2j2k 
k 


12 
* 


4 

3k 


Jk + 8 
3* + 12 


t(n) 




6 


6 


n - 4 


0 


n - 4 


it - 4 


3» - 6 


Uniform (a,b) 


0 


-1.2 


-1.2 


.556 



:RIC 



bO 

-S 
c 
u 

£ 
« 

I- 

. > 



co 



■>a- 
u 



W5 

■g 

2 

Ml 

.s 

J! 

•SP 
.0 

"+-> 

o 

<o 

"q3 
ca 

> 



W5 
V3 

<L> 
(3 
<D 
g; 

<u 

.6 
*-» 

ca 

<u 
erf 



u 

«s 



c 
o 



ca 

Oh 

s 

o 

u 

"ca 
g 

o 



bO 
C 

ca 

u 

"3 
c 

•o 

o 

c 

.0 
■»-> 
_2 
"0 

VI 
I 

C 

15 



N 



o 
o 



CM 
CO 



o 
o 



0\ 

co 



o 
o 



0\ 
CM 



co 
CM 



O 
O 



1— I 

o 



o 
o 



00 
00 



o 
o 



o 



00 

o 



•a 
o 

2 



I 

V 
00 



o 
o 



1— I 

On 



o 
o 



00 



o 
o 



o 



■>a- 
o 











.bp 
"« 




O* 


0 


V 


V 


00 





o 
o 



o 
o 



o 
o 



00 

CO 



o 
o 



o 
o 



CM 

00 



o 
o 



o 
o 



CO 



o 
o 



o 
o 



o 
o 



-3- 



o 

o 



CO 
CO 



rr 

o 



o 
o 



o 
o 



0\ 



o 
o 



v-i 



CM 

o 



o 
o 



V 



CM 



o 

CO 

o 



CM 



o 

V 

s 



v> 
A 



CO 
CO 
CO 

A 



CO 

■>a- 



o 
o 



o 
r-' 



00 

o 



bp 
"3 

55 

u 

E 

ca 



o 
o 



CO 
V"5 



o 



CO 
0\ 



o 



o 

1— I 

o 



c 
o 



u 

CS 



o 

tL, 



CD 



Ranks 




m 
CN 


06 

CN 






ja 




20% Outlier 


.972 
(.146) 


.959 
(.178) 


.962 
(.151) 


.956 
(.158) 


.941 
(.172) 


.873 
(.256) 


.894 
(.204) 


Ranks 


o 

T-l 




CO 


CO 


CO 


VO 


t> 


High Error 


OO t}- 

co o 

00 CO 


.841 
(.294) 


.819 
(.313) 


.827 
(.285) 


.849 
(.226) 


.750 
(.302) 


.721 
(.298) 


Ranks 


rt 


a 

T— ( 


it 

y—t 




o 


-a 
VO 


VO 


Low Error 


.934 
(.216) 


.936 
(.214) 


.923 
(.237) 


.922 
(.209) 


.941 
(.159) 


.899 
(.219) 


.886 
(.244) 


Method 


No Selection 


6 > .333 


o 

V 

5 


Ultrametric Weights 


Forward Selection 


b > .555 


m < -1.2 



CO 



oo 

-S 
c 

CO 
cS 

•a 

CS 

> 



c 
o 

■o 

u 

co 
CS 





CO 

•o 




O 




JC 


oo 


•*-» 




s 


Tab 


on 
c 




C 




u 




u 








y 




<*> 




-o 




c 




CS 




CO 




J3 




00 




"53 












u 




c 




IS 




e 




o 




U 




u 




a 








CO 








"3 




CO 











Rank 

(Paired 

t-tests) 


M 
1— 1 




2 


jS 
CO 


'in 






CO 


Rank 

(Ordinal 

Method) 


el 

1— 1 


O 

CS 


en 


T3 


o 

o 


o 

VO 


r- 


r- 


Overall 
Mean 


as 


CS 
VO 
OS 


OS 


oo 

OS 


Os 


.933 


»* 
en 

OS 


co 
o 

CS 


Number of Error Dimensions 


m 


i> 

On 


.962 


.939 


.929 


VO 
CS 

OS 


.920 


.920 


CS 

m 

CO 


cs 


OS 

os 


i— i 

VO 
OS 


OS 

OS 


o 

OS 


00 
c"> 
Os 


o 

Os 


CN 

Os 


r- 

00 


i— i 


.963 


CS 
VO 
OS 


CS 
TI- 
CS 


-<*• 
in 

OS 


.937 


t*- 

<n 

Os 


t» 

CS 
Os 


.917 


o 


os 

VO 
Ov 


CS 
VO 
OS 


CS 
OS 


t> 

VO 
CS 


CS 
VO 
OS 


00 
OS 


.966 


.984 


Method 


m < 0, 
weights 


b > .555, 
weights 


b > .555 


6 > .333, 
weights 


Ultrametric 
Weights 


cs 
i 

V 

E 


Forward 
Selection 


No Selection 



03 



60 

.S 
a 
u 
u 
u 
o 
C/5 



a 

La 

03 

> 



cn 



.O 



c 
o 



c 
o 

o 

L. 

o 



8 M 
13 o 
> 8 



> 
o 
u 

05 

c 

03 

05 



CO 

"O 
O 

4) 

60 
C 

60 



o .2- 
u 



e 
o 



T3 



55 8 



c 

os 

CO 

C 

s 



os 

OS 

> 

eu 
C 

S 
o 
O 



v 
•J 

3 

o 

o 

CN 



-a 



o 

Ui 

o 
i— I 



to -V 

u c 

F-c 03 

S 05 



.S 

Lc OS 

O 05 



to J<! 

S3 



L< OS 

O 05 



c 5> 

S Q 

C/5 



to 

0) c 
f-c 

5 05 



S 03 
.3 

"° S 

L. OS 

O o5 



00 



T3 
O 



in vo 

Ov r-l 



00 Tj" 

in o 

00 CO 



CO r-l 

Ov CN 



a 

_o 

* 

u 

CD 

o 



cn 



cn 



vo oo 

</"> IT) 
Ov r-c 



IT- </-> 
CN 00 
00 CN 



CO 



CM 



CN OS 
CS O 
Ov CN 



J5 
60 



CD 

s 

03 



CO 



CO 



cs 



O CO 
CM 00 
00 CN 



CN 



CN CN 
CS O 
Ov CN 



CO 

.SP 
*C 

5t 

o 

v 

5 



lO Ov 
CO OS 
ON i-H 



CN 



CN 00 
00 CN 



CN 



r- co 

CN Ov 

ov c 



60 

"53 

CO 
CO 
CO 

A 

-© 



'-i CN 

Cv ■<-< 



CN 



o\ <o 

^ CN 
«> CN 



CN 



ON *— I 



c 
o 



<0 

w 

OS 



O 

tu 



CO VO 
00 CN 



VO 



co 



VO 



VO 



0\ Ov 
00 CN 



in 
>n 
vo 

A 



oo 



Ov O 
00 CN 



•o 



~ S2 

CM ^ 



VO -<3- 
OO T 
00 CN 



CN 



8- 
VO 



VO 



TJ- 00 
CN 00 
Ov i—i 



CO 

-a 

o 

u 
S 



L. 

u 
o 

£ 
a 
a 

o 
x: 

a 
o 

T3 
U 
co 
OS 
O 

<u 

L. 

OS 
CO 

J* 

C 

.05 



VO 



u 
u 

4-* 

o 
a 

OS 

u 
c 
o 

s 

o 

.L. 



-o 



vo co 
r- cs 



c 

03 
O 

a 

60 

CO 

4-* 

o 

C 

o 
■o 



CN O 
00 CN 



.60 

*53 

A 



L. 

O 
GO 
L. 

D 
D, 

a 

CO 

^ L< 
C <U 

C ^ 

E o 

s r 

u c 

JS 03 

■ts .a 
=s sa 

■o i> 

O 60 



8- 
§ 

2 
I 



o 

ERIC 



-S 
c 

<u 

o 

00 



-.2 
> 



co 
■<3- 



00 

O 



o 



GO 

.2 
"C 

> 

.a 

TJ 

Ui 

CS 
TJ 
C 
c« 
*-» 

05 
C 

D 
_c 

"35 

D 

c 

o 

tJ 

"3 
oo 

TJ 

Ui 



O 

o 
c 

s 



C 

o 

so 

1-2 
o "3 
U £ 

C .S3 
O o 



OJO 
C 
c« 
.C 

U 

*« 
_c 

*"3 

Ui 

O 

c 
o 



o 

to 

.3 



5$ 



o 
o 



V 



c- 



o 
o 



N 



N 



Ov 



o 
o 



V 



co 
co 



vo 
00 

o 



CM 

o 



V 



CO 

oi 



in 
o 



A 



vo 



in 
o 



A 



o 



A 



CO 

o 



in 
o 



A 



o 



A 



CM 
CM 



CM 
O 

V 



CM 

ITS 

CM 



in 
o 



A 



CO 
CM 



in 
o 



A 



O 



o 
o 



co 

T— t 

o 



co 

T— t 
O 



co 
o 



CM 

o 



o 

o 



V 



vo 

CM 

o 



o 
o 



V 



00 
VO 
Ov 



ON 
0O 
O 



I 

o 
U 



.a 

C 



C 
o 

y 

s 

o 
U 



c 
o 

- r— I 
■*-» 

o 

"5 
oo 

T3 
s-. 



O 



"O 

SP 
"3 

vf 

ITS 

iri 

A 



"O 

"5 

V 



5P 
"53 
5 



c 
o 



<0 
00 

o 



. } 
L'5 




Variable Screening 
44 

Table 11 

Comparison of Forward Selection Using Unstandardized Variables with Other Methods 



Method 


Number of Error Dimensions 


Overall 
Mean 


Ordinal 
Ranks 


Paired 

t-test 

Ranks 


0 


1 


2 


3 


m < 0, 
weights 


.707 


.70J 


040 


Q^7 

.70 / 


.954 


r 


F 


b > .555, 
weights 






Qfi1 

.7U 1 


.7UL 


.962 




l g 


b > .555 


.942 


.942 


.939 


.939 


.941 


2 abc 


5 U 


Forward Selection, 
Unstandardized 


.986 


.948 


.942 


.936 


.953 


4>« 


3 ghi 


b > .333, 
weights 


.967 


.954 


.940 


.929 


.948 


4 M 


4 h 


Ultrametric 
Weights 


.962 


.937 


.938 


.926 


.941 




6* 


m < -1.2 


.948 


.937 


.930 


.920 


.933 




7 j 


Forward Selection, 
Standardized 


.966 


.927 


.925 


.920 


.934 


8 f 


7 j 


No Selection 


.984 


.917 


.877 


.852 


.908 


1 gf 


9 



Methods with common superscripts do not significantly differ from one another. Ranks are based 
the number of methods that were significantly lower. 
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Appendix 

Let X be distributed as a finite mixture composed of G distributions with mixing 
proportions ir ; (j=l,G), i.e.: 

f(X) = 5>//X) . 

We will be particularly interested in the normal mixture model, in which each of the subgroup 
distributions f/X) is ~ N^o,- 2 ). Let \i be the grand mean of X (i.e. /x = Stt^). Let M k be the 
fcth central moment for the mixture, and is the central moment for subgroup ;: 
= % (X — /x.j) k . Finally, by the linearity of the expectation operator, note that: 

G 

W(X)3 = £^[/(X)] . 

where ^ is the expectation with respect to the distribution of subgroup ;*. 

Kurtosis 

For a mixture, the fourth moment of the mixture distribution is: 

K = X>;2>(X - H) 4 
M * = X>;^((* - + (H; - ^)) 4 

;'=i 

Expanding the binomial and taking expectations yields: 
Similarly, the second moment is: 

m 2 b Ev/ + E^-^) 2 • (A 
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Using (1), the kurtosis of the mixture is: 



St = 



7-1 /^l >1 M _ J 



Let k = g 2 M 2 2 . Then expanding and collecting like terms yields: 

G f G \2 



G 



G N Z' G 

2 



+ E« / (^-^-3[e^(h / -^) 2 



* = E*,-^- - 3 E*,°, 4 + *E - vo - 2E*M - ^) 4 



+ 3var[o*] * 6cov[ 0/ 2 , (u- r n) 2 ] + 3war [(^.-[i) 2 ] 



(B) 



;-i /=i /-l y-i 

The above expression (B) is valid for any distribution which has the first four moments. Making 

explicit use of the fact that the subgroups are normal, M 4j = 3a* and M 3j = 0 for all/. Making 

these substitutions: 



k = 3v«r[o 2 + - u) 2 ] - 25>,.(ii ; - u) 4 
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and hence: 



82 



3var[o / 2 + - uO 2 ] - 2£« y (^ - u) 4 



Skewness 



(C) 



The skewness of a variable is: 



$1 



E* ; W-u) 3 

M 



l M *} 



>-i 

G G " ^ 



G G 
2 



E«W + EV^-^ 



(D) 



Assuming within— group normality, this reduces to: 



G G 
_2, 



81 



3 Evw- 1*) + E*,0 A ,- 

.H M 



G 

2 



(E) 



Variable Screening 



48 

Table Al 

Main Effects for Design Factors in the Simulation Study 



Number of Subgroups 


Mean 


Std. 


2 


.848 


.348 
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.925 


.189 
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.915 


.197 
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.876 


.219 




Number of Core Variables 
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Number of Error Dimensions 
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Subgroup Sizes 
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Equal 
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.153 


60% in one subgroup 
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Clustering Algorithm 
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Average Linkage 
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.222 


Ward's Method 
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.871 


.270 
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Table Al (cont.) 
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